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Abstract. Classical list scheduling is a very popular and efficient technique for sched- 
uling jobs in parallel and distributed platforms. It is inherently centralized. However, with 
the increasing number of processors, the cost for managing a single centralized list be- 
comes too prohibitive. A suitable approach to reduce the contention is to distribute the 
list among the computational units: each processor has only a local view of the work to 
execute. Thus, the scheduler is no longer greedy and standard performance guarantees are 
lost. 

The objective of this work is to study the extra cost that must be paid when the list is 
distributed among the computational units. We first present a general methodology for 

^J\ computing the expected makespan based on the analysis of an adequate potential function 

T~H which represents the load unbalance between the local lists. We obtain an equation on the 

evolution of the potential by computing its expected decrease in one step of the schedule. 
Our main theorem shows how to solve such equations to bound the makespan. Then, we 
apply this method to several scheduling problems, namely, for unit independent tasks, for 

f^ weighted independent tasks and for tasks with precendence constraints. More precisely, 

we prove that the time for scheduling a global workload W composed of independent unit 
tasks on m processors is equal to W/m plus an additional term proportional to log2 W. 
We provide a lower bound which shows that this is optimal up to a constant. This result 
is extended to the case of weighted independent tasks. In the last setting, precedence task 

^_^ graphs, our analysis leads to an improvement on the bound of |Arora et al|J200H . We 

^ finally provide some experiments using a simulator. The distribution of the makespan is 

^i^ shown to fit existing probability laws. Moreover, the simulations give a better insight on 

/»f~\ the additive teiTn whose value is shown to be around 3 log2 W confirming the tightness of 

r>'»». our analysis. 
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1. Introduction 

1.1. Context and motivations. Scheduling is a crucial issue while designing efficient 
parallel algorithms on new multi-core platforms. The problem corresponds to distribute the 
tasks of an application (that we will called load) among available computational units and 
determine at what time they will be executed. The most common objective is to minimize 
the completion time of the latest task to be executed (called the makespan and denoted 
by Cinax)- It is a hard challenging problem which received a lot of attention during the 
last decade (Leung, 2004*. Two new books have been published recently on the topic 
( |Drozdowski)|2009nRobert and Vivien) |2009| ), which confirm how active is the area. 
List scheduling is one of the most popular technique for scheduling the tasks of a par- 



allel program. This algorithm has been introduced by [Graham ( 1969) and was used with 



profit in many further works (for instance the earhest task first heuristic which extends the 
analysis for communication delays in Hwang et al ( 1989|l, for uniform machines in Chekuri 



|and Bender| ( |2001| , or for parallel rigid jobs in Schwiegelshohn etal| ( |2008] )). Its principle 
is to build a list of ready tasks and schedule them as soon as there exist available resources. 
List scheduling algorithms are low-cost (greedy) whose performances are not too far from 
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optimal solutions. Most proposed list algorithms differ in the way of considering the pri- 
ority of the tasks for building the list, but they always consider a centrahzed management 
of the list. However, today the parallel and distributed platforms involve more and more 
processors. Thus, the time needed for managing such a centralized data structure can not 
be ignored anymore. Practically, implementing such schedulers induces synchronization 
overheads when several processors access the list concurrently. Such overheads involve 
low-level synchronization mechanisms. 



1.2. Related works. Most related works dealing with scheduling consider centralized list 
algorithms. However, at execution time, the cost for managing the list is neglected. To our 
knowledge, the only approach that takes into account this extra management cost is work 



stealing (Blumofe and Leiserson 1999 1 (denoted by WS in short). 

Contrary to classical centralized scheduling techniques, WS is by nature a distributed 
algorithm. Each processor manages its own list of tasks. When a processor becomes 
idle, it randomly chooses another processor and steals some work. To model contention 
overheads, processors that request work on the same remote list are in competition and 
only one can succeed. WS has been implemented in many languages and parallel libraries 
including Cilk ( jFrigo et al| [1998] ), TBB ([Robison et a l] [20081) and KAAPI ([Gautier et al| 
2007| l. It has been analyzed in a seminal paper of Blumofe and Leiserson ( 1999 1 where they 



show that the expected makespan of series -parallel precedence graph with W unit tasks on 
m processors is bounded by E [C,nax] < W/m + 0{D) where D is the critical path of 
the graph (its depth). This analysis has been improved in |Arora et al (2001') using a proof 
based on a potential function. The case of varying processor speeds has been analyzed 
in [Bender and Rabin] p002| l. However, in all these previous analyses, the precedence 
graph is constrained to have only one source and out-degree at most 2 which does not easily 
model the basic case of independent tasks. Simulating independent tasks with a binary tree 
of precedences gives a bound of W/m + 0(log W) as a complete binary tree of W nodes 
has a depth of D < log2 W . However, with this approach, the structure of the binary tree 
dictates which tasks are stolen. Our approach achieves a bound of the same order with a 
better constant and processors are free to choose which tasks to steal. Notice that there 
exist other ways to analyze work stealing where the work generation is probabihst and that 



targets steady state results (Berenbrink et al 2003 Mitzenmacher 1998 Gast and Gaujal 
|20T0l l. 

Another related approach which deals with distributed load balancing is balls into bins 



games (Azar et al 1999; Berenbrink et al 2008). The principle is to study the maximum 
load when n balls are randomly thrown into m bins. This is a simple distributed algorithm 
which is different from the scheduling problems we are interested in. First, it seems hard 
to extend this kind of analysis for tasks with precendence constraints. Second, as the load 
balancing is done in one phase at the beginning, the cost of computing the schedule is not 
considered. [Adler et al| ( |1995| ) study parallel allocations but still do not take into account 
contention on the bins. Our approach, like in WS, considers contention on the lists. 

Some works have been proposed for the analysis of algorithms in data structures and 
combinatorial optimization (including variants of scheduling) using potential functions. 
Our analysis is also based on a potential function representing the load unbalance between 
the local queues. This technique has been successfully used for analyzing convergence to 



Nash equilibria in game theory ( Berenbrink et al 2007J ), load diffusion on graphs ( Beren- 
|brinketal[|2009| and WS dAroraet al||2007F 
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1.3. Contributions. List scheduling is centralized in nature. The purpose of this work is 
to study the effects of decentralization on list scheduling. The main result is a new frame- 
work for analyzing distributed list scheduling algorithms (DLS). Based on the analysis of 
the load balancing between two processors during a work request, it is possible to deduce 
the total expected number of work requests and then, to derive a bound on the expected 
makespan. 

This methodology is generic and it is applied in this paper on several relevant variants 
of the scheduling problem. 

• We first show that the expected makespan of DLS applied on W unit indepen- 
dent tasks is equal to the absolute lower bound W/m plus an additive term in 
3.65 log2 W. We propose a lower bound which shows that the analysis is tight up 
to a constant factor. This analysis is refined and applied to several variants of the 
problem. In particular, a slight change on the potential function improves the mul- 
tiplicative factor from 3.65 to 3.24. Then, we study the possibility of processors 
to cooperate while requesting some tasks in the same list. Finally, we study the 
initial repartition of the tasks and show that a balanced initial allocation induces 
less work requests. 

• Second, the previous analysis is extended to the weighted case of any unknown 
processing times. The analysis achieves the same bound as before with an extra 
term involving Pmax (the maximal value of the processing times). 



• Third, we provide a new analysis for the WS algorithm of Arora et al (2001 1 for 
scheduling DAGs that improves the bound on the number of work requests from 
32inD to 5.5mD. 

• Fourth, we developed a complete experimental campaign that gives statistical evi- 
dence that the makespan of DLS follows known probability distributions depend- 
ing on the considered variant. Moreover, the experiments show that the theoretical 
analysis for independent tasks is almost tight: the overhead to W/m is less than 
37% away of the exact value. 

1.4. Content. We start by introducing the model and we recall the analysis for classical 
list scheduling in Section|2] Then, we present the principle of the analysis in Section|3]and 
we apply this analysis on unit independent tasks in Section[4] Section[5]discusses variations 
on the unit tasks model: improvements on the potential function and cooperation among 
thieves. We extend the analysis for weighted independent tasks in Section l6] and for tasks 
with dependencies in SectioniT] Finally, we present and analyze simulation experiments in 
SectionlH 

2. Model and notations 

2.1. Platform and workload characteristics. We consider a parallel platform composed 
of m identical processors and a workload of n tasks with processing times pj . The total 
work of the computation is denoted hy W = jy^^iPj- The tasks can be independent or 
constrained by a directed acyclic graph (DAG) of precedences. In this case, we denote 
by D the critical path of the DAG (its depth). We consider an online model where the 
processing times and precedences are discovered during the computation. More precisely, 
we learn the processing time of a task when its execution is terminated and we discover 
new tasks in the DAG only when all their precedences have been satisfied. The problem is 
to study the maximum completion time (makespan denoted by Cmax) taking into account 
the scheduling cost. 
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Figure 1 . A typical execution ofW — 2000 unit independent tasks on 
m — 25 processors using distributed list scheduling. Grey area repre- 
sents idle times due to steal requests. 



2.2. Centralized list scheduling. Let us recall briefly the principle of list scheduling as 
it was introduced by |Graham| ( |1969| l. The analysis states that the makespan of any list 
algorithm is not greater than twice the optimal makespan. One way of proving this bound 
is to use a geometric argument on the Gantt chart: m • Cmax — W + S'idie where the last 
term is the surface of idle periods (represented in grey in figure fT). 

Depending on the scheduling problem (with or without precedence constraints, unit 
tasks or not), there are several ways to compute S'idie- With precedence constraints, S'idie < 
(to — 1) • D. For independent tasks, the results can be written as Sidie < (to, — 1) • Pmax 
where Pmax is the maximum of the processing times. For unit independent tasks, it is 
straightforward to obtain an optimal algorithm where the load is evenly balanced. Thus 
Sidle < Tn — 1, i.e. at most one slot of the schedule contains idle times. 

2.3. Decentralized list scheduling. When the list of ready tasks is distributed among the 
processors, the analysis is more complex even in the elementary case of unit independent 
tasks. In this case, the extra Sidie term is induced by the distributed nature of the problem. 
Processors can be idle even when ready tasks are available. Fig.fllis an example of a sched- 
ule obtained using distributed list scheduling which shows the complicated repartition of 
the idle times Sidie- 

2.4. Model of the distributed list. We now describe precisely the behavior of the dis- 
tributed list. Each processor i maintains its own local queue Qi of tasks ready to execute. 
At the beginning of the execution, ready tasks can be arbitrarily spread among the queues. 
While Qi is not empty, processor i picks a task and executes it. When this task has been 
executed, it is removed from the queue and another one starts being processed. When 
Qi is empty, processor i sends a steal request to another processor k chosen uniformly at 
random. If Qj. is empty or contains only one task (currently executed by processor k), 
then the request fails and processor i will send a new request at the next time step. If Qk 
contains more than one task, then i is given half of the tasks and it will restart a normal 
execution at the next step. To model the contention on the queues, no more than one steal 
request per processor can succeed in the same time slot. If several requests target the same 
processor, a random one succeeds and all the others fail. This assumption will be relaxed 



in Section 5.2 A steal request is said successful if the target queue contains more than one 
task and the request is not aborted due to contention. In all the other cases, the steal request 
is said unsuccessful. 

This is a high level model of a distributed list but it accurately models the case of 



independent tasks and the WS algorithm of Arora et al ( 2001[ ). We justify here some 
choices of this model. There is no explicit communication cost since WS algorithms most 
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often target shared memory platforms. In addition, a steal request is done in constant time 
independently of the amount of tasks transfered. This assumption is not restrictive as the 
description of a large number of tasks can be very short. In the case of independent tasks, 
a whole subpart of an array of tasks can be represented in a compact way by the range of 
the corresponding indices, each cell containing the effective description of a task (a STL 
transform in'Traore et ar(2008")). For more general cases with precedence constraints, it is 
usually enough to transfer a task which represents a part of the DAG. More details on the 
DAG model are provided in SectioniT] Finally, there is no contention between a processor 
executing a task from its own queue and a processor stealing in the same queue. Indeed, one 



can use queue data structures allowing these two operations to happen concurrently (Frigo 
leFallfTWSl l. 

2.5. Properties of the work. At time t, let Wi{t) represent the amount of work in queue 
Qi (cf. Fig.l2|. Wi{t) may be defined as the sum of processing times of all tasks in Qi as in 
SectionHbut can differ as in Sections I6]andl7] In all cases, the definition of Wi{t) satisfies 
the following properties. 

(1) When Wi{t) > 0, processor i is active and executes some work: Wi{t+1) < Wi{t). 

(2) When Wi (t) = 0, processor i is idle and send a steal request to a random processor 
k. If the steal request is successful, a certain amount of work is transfered from 
processor k to processor i and we have inax{wi{t + l),Wk{t + 1)} < Wk{t). 

(3) The execution terminates when there is no more work in the system, i.e. Vz, Wi (t) ~ 
0. 



We also denote the total amount of work on all processors by w{t) 
the number of processors sending steal requests by rt E [0, m — 1]. 
rt = m, all queues are empty and thus the execution is complete. 



= E™iW^iW and 
Notice that when 
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(a) Workload at time t 



Wi{t+l) 



(b) Workload at time step i + 1 



Figure 2 . Evolution of the workload of the different processors during 
a time step. At time t, processors 2 and 3 are idle and they both choose 
processor 1 to steal from. At time t + 1, only processor 2 succeed in 
stealing some of the work of processor 1. The work is split between the 
two processors. Processors 1 and 4 both execute some work during this 
time step (represented by a shaded zone). 



3. Principle of the analysis and main theorem 

This section presents the principle of the analysis. The main result is Theorem [T] that 
gives bounds on the expectation of the steal requests done by the schedule as well as the 
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probability that the number of work requests exceeds this bound. As a processor is either 
executing or requesting work, the number of work requests plus the total amount of tasks to 
be executed is equal to ni ■ Cmax, where Cmax is the total completion time. The makespan 
can be derived from the total number of work requests: 

W R 

(1) L/niax = I ■ 

m m 

The main idea of our analysis is to study the decrease of a potential $t .The potential $t 
depends on the load on all processors at time t, w(i). The precise definition of $t varies 
depending on the scenario (see Sections HtolTli. For example, the potential function used 
in SectionHis $4 = X)i=i(^i(^) ~ 'w{t)/rny. For each scenario, we will prove that the 
diminution of the potential during one time step depends on the number of steal requests, 
rt- More precisely, we will show that there exists a function h : {{)... m] — > [0; 1] such 
that the average value of the potential at time t + 1 is less than ^f/h{rt). 

Using the expected diminution of the potential, we derive a bound on the number of steal 
requests until $t becomes less than one, R ~ X]I=o ^s' where r denotes the first time that 
$f is less than 1. If all rt were equal to r and the potential decrease was deterministic, the 



log$o/log/i(r) 



and the number of steal 



number of time steps before $t < 1 would be 

requests would be r/ log h{r) log <l>o. As r can vary between 1 and m, the worst case for 

this bound is mX ■ log $0, where mX = maxi<r<m r/ \og{h{r)). 

The next theorem shows that number of steal requests is indeed bounded by mX log $0 
plus an additive term due to the stochastic nature of $f. The fact that A corresponds to 
the worst choice of rt at each time step makes the bound looser than the real constant. 
However, we show in Section [8] that the gap between the obtained bound and the values 
obtained by simulation is small. Moreover, the computation of the constant A is simple and 
makes this analysis applicable in several scenarios, such as the ones presented in Sections 

atoE] 

In the following theorem and its proof, we use the following notations. J^t denotes 
the knowledge of the system up to time t (namely, the filtration associated to the process 
w(t)). For a random variable X, the conditional expectation of A knowing -J^t is denoted 
E [X I ^t]- Finally, the notation 1^ denotes the random variable equal to 1 if the event 
A is true and otherwise. In particular, this means that the probability of an event A is 
V{A}^E[1a]. 

Theorem 1. Assume that there exists a function h : {0...m} — >■ [0,1] such that the 
potential satisfies: 

E[$t+i \ ^t] < Hrt) ■ <Pf 

Let <l>o denotes the potential at time and X be defined as: 

, dof -r 

X = max 



i<r<m —m\og2{h{r)) 

Let T he the first time that $4 is less than \, t — niinjt : $( < 1}. The number of steal 
requests until t, R — X]I=o '«' s<^tisfies: 

(i) V{R>m-X- log2 $(0) +m + u}< 2-"/("-^) 
(ii) E [i?] < m • A • log2 $(0) + to(1 + j^). 
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Proof. For two time steps t < T,we call R^ the number of steal requests between t and 
T: 

min{r,T}-l 



R 



.j' dcf 



E 



The number of steal requests until $t < 1 is i? = J2l=o ^s — limr-i-oo ^o^- 
We show by a backward induction on t that for all t <T: 



(2) 



if <&t > l,thcnVu e 



E 



'^Rj'>7n-X-\og2 *t+m+ti I ^t 



^ iy~u/ (m-X) 



For t=T, i?^ = and E 



Irt 



i?j >m-A-log2 'i>f+^'T' + 'U 



^t = 0. Thus, (|2| is ti-ue for t=T. 
Assume that (|2]i holds for some t+1 < T and suppose that $t > 1. Let u > (if u < 
... ). Since Rf ^ n + Rj+i, the probability P [rJ > m • A • log2 $t + ?" + " | =^t} is 
equal to 



(3) E 

(4) 

(5) 



l_Rf >mAlog2 *t+m+M I '^t 



= E 
= E 
+ E 



'-rt + flf_|_i>mAlog2 *t+m+u I ^t 



'-rt + 7?,j;i_j>mAlog2 *t+m+u-^*t + 



l<I.t + i>l I =^t 



'-rt+_R^^j>mAlog2 $t+)ri+ii -*-*(+ 



ul*f+i<l I =^t 



If $t+i < 1, then i?^i = 0. Since m > rt and $t > 1, mA loga ^t + m + u - rt > 0. 
This shows that the term of Equation (|5]l is equal to zero. Q is the probability that RJ_^.l 
is greater than 

mX log2 $t + m + u — rt = mA logj ^t+i + m + {u — rt — mX log($f_|_i/<l>t) 

Therefore, using the induction hypothesis, (HI is equal to 



E 



'--R^_^j>mAlog2 *t+m+«-rt-'-*t+l>l 



^t 



= E 


u-rj 

2 


-™Alog(#t^l/«t) 


=^t 


= 2" 


".a'e 


'**+H 1 ^' 






= 2" 


™a' /i(r,) 




= 2- 


^^7X2'^'/ 


\+loe.,{h{rt)) 

5 







where at the first line we used both the fact that for a random variable X, E [X | ^t] — 
E [E [X I .-^t+i] I -^t] and the induction hypothesis. 

Ifrt = 0, 2''*/^+i°S2('»('-t)) :^ h(^rt) < 1. Otherwise, by definition of A = maxi^<„ r/- 
log(/i(r)), rt/X + log^ihin)) < and 2'~*/^+i°g2(''('^t)) < i. This shows that (|2| holds 
for t. Therefore, by induction on t, this shows that (|2| holds for t = 0: for all u> 0: 



E 



'-flJ'>m-A-log2 <I>t+m+ 



ul^C 



^ r\~u/ (m- X) 



As Tf > 0, the sequence {Rq)t is increasing and converges to R. Therefore, the sequence 



-R|^>rn- A-log2 <l>Q+/n+u 



is increasing in T and converges to 1 



/?>?n-A-log2 ^^o+^'^+w 



Thus, by 



Lebesgue's monotone convergence theorem, this shows that 

1 



' {i? > m • A • log2 ^0 + m + u} = lim E 

T— >-oo 



i?Q >7n-A-log2 $0+^1+''^ 



< 2- 
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The second part of the theorem (ii) is a direct consequence of (i). Indeed, 

/•OO 

E[R]= ¥{R>u} du 

Jo 

/•OO 

< m • A • log2 ^a + m+ P{i?>TO-A- logj ^o + m + u} du 

Jo 

/>oo 

< TO • A • log2 $0 + ™+ / 2~'^du 



< TO • A • log2 $0 + m{l + —). 

D 



a 



4. Unit independent tasks 



We apply the analysis presented in the previous section for the case of independent unit 
tasks. In this case, each processor i maintains a local queue Qi of tasks to execute. At 
every time slot, if the local queue Qi is not empty, processor i picks a task and executes 
it. When Qi is empty, processor i sends a steal request to a random processor j. If Qj is 
empty or contains only one task (currently executed by processor j), then the request fails 
and processor i will have to send a new request at the next slot. If Qj contains more than 
one task, then i is given half of the tasks (after that the task executed at time t by processor 
j has been removed from Qj). The amount of work on processor i at time t, Wi{t), is the 
number of tasks in Qi{t). At the beginning of the execution, w(0) = W and tasks can be 
arbitrarily spread among the queues. 

4. 1 . Potential function and expected decrease. Applying the method presented in Sec- 
tion [3] the first step of the analysis is to define the potential function and compute the 
potential decrease when a steal occurs. For this example, $(i) is defined by: 

2 ™ _,,2/' 



i=l ^ ^ 1=1 



w'jt) 
m 



This potential represents the load unbalance in the system. If all queues have the same load 
Wi{t) = w{t)/m, then $(t) = 0. ^(i) < 1 implies that there is at most one processor with 
at most one more task than the others. In that case, there will be no steal until there is just 
one processor with 1 task and all others idle. Moreover, the potential function is maximal 
when all the work is concentrated on a single queue. That is ^(t) < w{t)'^ — w{t)'^ /m < 
(1- l/m)w^{t). 

Three events contribute to a variation of potential: successful steals, tasks execution and 
decrease of w'^{t)/m. 

(1) If the queue i has Wi{t) > 1 tasks and it receives one or more steal requests, it 

chooses a processor j among the thieves. At time t + 1, i has executed one task 

and the rest of the work is split between i and j. Therefore, 



(m,(i) - l)/2 



and 'Wj{t + 1) 



Wi(t 



(6) 



Wi{t+l) 

Thus, we have: 

Therefore, this generates a difference of potential of 

S,{t)>w,{tf/2 + w,{t)-l. 



iw,{t) - l)/2 



K;(i)-l)/2 <w^it)y2-w,it) + l. 
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(2) If i has Wi{t) > 1 tasks and receives zero steal requests, it potential goes from 
Wi{tY to {'Wi{t) — 1)'^, generating a potential decrease of 2wj(t) — 1. 

(3) As there are m — rt active processors, (X^i^i ^i(''))^/"^ 8°^^ from w{t)'^/Tn 
to w(t + 1)^ = {■w{t) — m + r)"^ /m, generating a potential increase of 2(771 — 

rt)w{t)/ra — {m — rt)"^ /m. 

Recall that at time t, there are rt processors that send steal requests. A processor i receives 
zero steal requests if the rt thieves choose another processor Each of these events is 
independent and happens with probability (m — 2)/(m — 1). Therefore, the probability for 
the processor to receive one or more steal requests is q{rt) where 

1 ^"^^ 



q{rt) = 1 - 1 - 



771—1 



If $4=$ and rt=r, by summing the expected decrease on each active processor 5i, the 
expected potential decrease is greater than: 



E 

i/wi{t)>Q 



q{r) ( "^ + w,{t)-l ) +(1 - q{r)){2w,{t)-l) 



, ^m — r (m — r)^ 
2w(t) + ^ '- 



E 



q{r) 



.{t? 



i/wi(t)>Q 



q{r)w{t) + 2w{t) - {m - r) - 2w{t) 



m ~ r {m — r)^ 



Using that 2w{t) - 2w{t)^ = 27«(t)^, that -{m - r) + ^^^ = -(r 
that X) Wj(t)2 = $ + w{ty, this equals: 

q(r) ^ q(r) wit)"^ , ^ , ^ „ , n J' , ^ r 

^^^^$+ '^1^^^±^ ~ q{r)w{t) + 2w{t) (?77.-r) — 

2 2 777, mm 

q{r)^ q{r)w{t)^ 



and 



2 2 



qir)w{t) H {2w{t) - m + r) 

m m 



QJr)^ I lir)w{t ) (w{t) 



2r \ r 

- 2 H + — (w(i) - 771 + r) . 

mqir) J m 

By concavity of 2: H> (1 — (1 — xY), (1 — (1 — xY) < r ■ x. This shows that q{r) = 
1 ~ (1 ~ ^rn^y — ''/("^ ^ !)■ Thus, r/q{r) > m— I. Moreover, as 777 — r is the number 
of active processors, w > m~r (each processor has at least one task). This shows that the 
expected decrease of potential is greater than: 



^$ + g(^)^W (^ ^W -2 + 2"^^ "^ 



q(r) ^ q(r)w(t) , , . „, 



If w{t) > 2, then the expected decrease of potential is greater than g(rt)$f/2. lfw{t) < 2, 
this means that 'w{t) — 1 and ■w{t + 1) = and therefore $t+i — 0. Thus, for all t: 



E[$t+i|^*]<(l-^V*t- 



(7) 



4.2. Bound on the makespan. Using Theorem [T] of the previous section, we can solve 
equation (|7]i and conclude the analysis. 

Theorem 2. Let Cmax be the makespan ofW~n unit independent tasks scheduled by 
DLS and $0 = X]i(^i — li-)2 the potential when the schedule starts. Then: 
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W I / 1 

(i) E [C„,ax] < — + —-. 7— ^V • l0g2 $0 

m 1 — loffod + - I V 



m l-log2(l + i) V °^ ln2 

1 • log2 $0 + log2 - 

l-log2(l+5) V e 



(ii) P <i C„,ax >— + ,^_, ,, , 1, • ( log2 *0 + l0g2 T ) + 1 f" < e 



/n particular: 

W 2 / 1 

(iii) E [Cnax] < - + —-. 7— -TV • l0g2 ^ + ,r^ 

m 1 — log2(l + -) V 2 In 2 

These bounds are optimal up to a constant factor in log2 W. 

Proof. Equation shows that E [$t+i|^t] < g{rt)<^t with g{r) = 1 - q{r)/2. Defining 
$'( = $t/(l — l/(m — 1)), the potential function $t also satisfies (|7|. Therefore, $t 
satisfies the conditions of Theorem[T] This shows that the number of work requests R until 
$'( < 1 satisfies 

E[R]<m • Alog2($o) + m(l + ^) , 

with A = maxi<r<m-i ^/(^"^log2 h{r)). One can show that r/(—m log2 h{r)) is de- 
creasing in r. Thus its minimum is attained for r = 1. This shows that A < 1/(1 — log2(l + 

The minimal non zero-value for $( is when one processor has one task and the others 
zero. In that case, <l>t = 1 ^ l/{m — 1). Therefore, when $'( < 1, this means that $t = 
and the schedule is finished. 

As pointed out in Equation ([T]), at each time step of the schedule, a processor is either 
computing one task or stealing work. Thus, the number of steal requests plus the number 
of tasks to be executed is equal to m ■ Cmax, i-e. m ■ Cmax = W + R. This shows that 



W 1 A ^ 1 



E [C„iax] < — + -. ,._, ,, . 1^ ■ ( log2 *0 + TTTT ) + 1- 



This concludes the proof of (i). The proof of the (i) applies mutatis mutandis to prove 
the bound in probability (ii) using Theorem[T]f//j. 

We now give a lower bound for this problem. Consider W = 2*^+^ tasks and m ^ 2^ 
processors, all the tasks being on the same processor at the beginning. In the best case, all 
steal requests target processors with highest loads. In this case the makespan is Cmax = 
k + 2: the first k — log2 m steps for each processor to get some work; one step where all 
processors are active; and one last step where only one processor is active. In that case, 

Cmax >Z+ l0g2 ^ - I. D D 

This theorem shows that the factor before log2 W is bounded by 1 and 2/(1 — log2(l + 
1/e)) < 3.65. Simulations reported in Section Is] seem to indicate that the factor of log2 W 
is slightly less than 3.65. This shows that the constants obtained by our analysis are sharp. 

4.3. Influence of the initial repartition of tasks. In the worst case, all tasks are in the 
same queue at the beginning of the execution and $o = {W — W/m)'^ < W^. This 
leads to a bound on the number of work requests in 3.65m log2 W (see the item (iii) of 
Theorem l2]i. However, using bounds in terms of $o> our analysis is able to capture the 
difference for the number of work requests if the initial repartition is more balanced. One 
can show that a more balanced initial repartition ($o *C W^) leads to fewer steal requests 
on average. 

Suppose for example that the initial repartition is a balls-and-bins assignment: each 
tasks is assigned to a processor at random. In this case, the initial number of tasks in queue 
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i, Wi{Q), follows a binomial distribution B{W, 1/m). The expected value of $o is: 

w2 , , Ti/2 , -1 , 

•■^ — ' '- ^ m ^ — ' V / m \ m/ 

i i 

Since the number of work requests is proportional to log2 ^q, this initial repartition of 
tasks reduces the number of steal requests by a factor of 2 on average. This leads to a 
better bound on the makespan in W/ni + 1.83 log2 W + 3.63. 

5. Going further on the unit tasks model 

In this section, we provide two different analysis of the model of unit tasks of the pre- 
vious section. We first show how the use of a different potential function $t = J^i Wi{tY 
(for some v > 1) leads to a better bound on the number of work requests. Then we show 
how cooperation among thieves leads to a reduction of the bound on the number of work 
requests by 12%. The later is corroborated by our simulation that shows a decrease on the 
number of work requests between 10% and 15%. 

5.1. Improving the analysis by changing the potential function. We consider the same 
model of unitary tasks as in Section |4] The potential function of our system is defined as 

where i/ > 1 is a constant factor. 

When an idle processor steals a processor with Wi{t) tasks, the potential decreases by 

d^^w^it) - — - — +^ — - — J >wi{t) - L^— J + L^~ 

This shows that the expected value of the potential at time t + 1 is 

E[$t+i]<(l-g(r)(l-2i-'^)).$t. 

where q{r) is the probability for a processor to receive at least one work request when r 

processors are stealing, q{r) = 1 — I 1 — ^^^^ j . 

Following the analysis of the previous part, and as $o < W'^ the expected makespan is 
bounded by: 

W /' I \ W ( 1 

— ^\(y)- log$o + l + r^ < — + v\[v)- logM^ + l + — - 
m \ In 2 / TO \ In 2 

where \(y) is a constant depending on v equal to: 

(8) A(i^) = max-^ — ; -, -j——, — ; — r- > 

^ ' r Ulog2(l-g(r)(l-2i-))i 

As for 1/ = 2 of Section |4] it can be shown the maximum of Equation [8] is attained for 
r = TO — 1. 

The constant factor in front of log W is v\(v). Numerically, the minimum of v\(y) is 
for V K, 2.94 and is less than 3.24. 

Theorem 3. Let Cmax be the makespan ofW = n unit independent tasks scheduled DLS. 
Then: 

E [Cnax] <—+ 3.24 • (l0g2 W + -^) + 1 

TO V 2 in 2 / 
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In Section|4J we have shown that the makespan was bounded by 

W I \ \ W I 1 

— + 2A(2) • (log2 $0 + + 1 < _ + 3.65 • (log2 W ^ --- 

m \ ill 2/ m \ 2 In 2 

Theorem [3] improves the constant factor in front of log2 W. However, we loose the infor- 
mation of the initial repartition of tasks <l>o- 

5.2. Cooperation among thieves. In this section, we modify the protocol for managing 
the distributed list. Previously, when fc > 1 steal requests were sent on the same processor, 
only one of them could be served due to contention on the list. We now allow the k 
requests to be served in unit time. This model has been implemented in the middleware 
Kaapi ([Gautier et al 2007 1. When k steal requests target the same processor, the work 



is divided into k + 1 pieces. In practice, allowing concurrent thieves increase the cost of 
a steal request but we neglect this additional cost here. We assume that the k concurrent 
steal requests can be served in unit time. We study the influence of this new protocol on 
the number of steal requests in the case of unit independent tasks. 
We define the potential of the system at time t to be: 



7n 



Let us first compute the decrease of the potential when processor i receives fc > 1 steal 
requests. If Wi{t) > 0, it can be written Wi{t) = (fc + l)g + b with < 6 < fc + 1. We 
neglect the decrease of potential due to the execution tasks (z^ > 1 implies that execution 
of tasks decreases the potential). 

After one time step and fc steal requests, the work will be divided into r parts with q + l 
tasks and fc + 1 — r parts with q tasks. J^i 'Wi{t) does not vary during the stealing phase. 
Therefore, the difference of potential due to these fc work requests is 

5'y = ((fc + l)q + by - b{q + 1Y ~{k + l- b)qr 

Letusdenotea =^ V(fc + 1) ^ [0; 1) andlet /(x) = {x + af + {l-2^-''){x + a)-{l- 
a)x'' - a(x + ly. The first derivative of / is f'{x) = v(x + a)"-'^ + (1 - 2^-") - v[l - 
a^x"-^ - a{x + ly-^ and the derivative of /' is f'{x) = v{l - v){(x + ay-"^ - (1 - 
a)x^~'^ — a{x + ly^"^. As J/ < 3, the function x i-^- x'^^'^ is concave which implies than 
f"{x) > 0. Therefore, /' is increasing. Moreover, /'(O) = iyia''-^~a) + {l-2^-'') > 0. 
This shows that for all x, f'{x) > and that / is increasing. The value of / in is 
/(O) = a"^ - (1 - 2i-'')a - a = a''(l - (2a)i^'') > which implies that for all x, 
fix) > 0. 

Recall that Wi{t) = {k + l)q + b and a = b/{k + 1). Using the notation / and the fact 
that (fc + 1)^~^ < 2^^", the decrease of potential 6'^ can be written 

5f = (1 - (fc + ly-n ■ (w.ity - w^it)) + (fc + 1) • /(?) 
(9) >{i-{k+iy-n-{w^{ty-wdt)). 

Let (7fe (r) be the probability for a processor to receive fc work requests when r processors 
are stealing. qk{r) is equal to: 
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The expected decrease of the potential caused by the steals on processor i is equal to 
Sl=o ^i1k{^)- Using equation (|9]l, we can bound the expected potential at time t + 1 by 



r 

E [$t+i I ^t] < (l - 51(1 - (fc + 1)1-'^) • qk{r)) ■ $4 



fe=0 

Theorem 4. The makespan C^°^ ofW = n unit independent tasks scheduled with coop- 
erative work stealing satisfies: 

(i) E [CrxH < - + 2.88 • log2 W + 3.4 
ra 

(ii) P JCZT > ^ + 2.88 • log2 W + 2 + log2 Q") I < e. 
Proof. The proof is very similar to the one of Theorem |2] Let 

r 

Mr) = 1-5^(1 -(fc + 1)1-'') -g.M 

fc=o 

and 

A™°P(i.) =' max / ^, , . 

i<r<m — m • log2 a(r') 

Using TheoremfT] we have: 

E [C™T] < ^ + ^A-P(^) . log2 W+^ + 1. 

In the general case the exact computation of h{r) is intractable. However, by a numerical 
computation, one can show that 3A'^°°p(3) < 2.88. 

When $t < 1, we have ^^ Wi{tY — u'i(i) < 1. This implies that for all processor i, 
Wi{t) equals or 1. This adds (at most) one step of computation at the end of the schedule. 
As A(3)/ ln(2) + 1 + 1 = 3.4, we obtain the calimed bound. D D 

Compared to the situation with no cooperation among thieves, the number of steal re- 
quests is reduced by a factor 3.24/2.88 ~ 12%. We will see in Sectionlslthat this is close 
to the value obtained by simulation. 

Remark. The exact computation can be accomplished f or j/ = 2 (jTchiboukdjian et al 



20101 and leads to a constant factor of 2A™°P(2) < -2/log2(l- i) < 3.02. 



6. Weighted independent tasks 

In this section, we analyze the number of work requests for weighted independent tasks. 
Each task j has a processing time pj which is unknown. When an idle processor attempts 
to steal a processor, half of the tasks of the victim are transfered from the active processor 
to the idle one. A task that is currently executed by a processor cannot be stolen. If the 
victim has 2fc(+l) tasks (plus one for the task that is currently executed), the work is split 
in fc(+l), fc. If the victim has 2fc + 1(+1) tasks, the work is split in fc(+l), fc + 1. 

In all this analysis, we consider that the scheduler does not know the weight of the 
different tasks pj . Therefore, when the work is split in two parts, we do not assume that 
the work is split fairly (see for example Figure [3]) but only that the number of tasks is split 
in two equal parts. 
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Figure 3. Evolution of the repartition of tasks during one time step. 
At time t, one processor has all the tasks, pi can not be stolen since the 
processor 1 has akeady started executing it. After one work request done 
by the second processor, one processor has 3 tasks and one has 2 tasks 
but the workload may be very different, depending on the processing 
times pj. 



6.1. Definition of the potential function and expected decrease. As the processing 
times are unknown, the work cannot be shared evenly between both processors and can 
be as bad as one processor getting all the smallest tasks and one all the biggest tasks (see 
Figurelsll. Let us call Wi (t) the number of tasks possessed by the processor i. The potential 
of the system at time t is defined as: 



(10) 



$, =^K(t)''-u;,;(i)). 



During a work request, half of the tasks are transfered from an active processor to the 
idle processor. If the processor j is stealing tasks from processor i, the number of tasks 

possessedby i and j at time i + 1 are Wj(t + 1) = Wi{t)/2 andwi(i + l)= Wi{t)/2 . 
Therefore, the decrease of potential is equal to the one of the cooperative steal of Equation 9\ 
for fc = 1: 

Following the analysis of Section|5]2J this shows that in average: 

(11) E[$t+i]<(l-(l-2i-'^)g(r)).$t. 



6.2. Bound on the makespan. Equation 11 allows us to apply Theorem [Tito derive a 
bound on the makespan of weighted tasks by the distributed list scheduling algorithm. 
This bound differs from the one for unit tasks only by an additive term of p„iax. 



dof 



Theorems. Lefp^ax — mayipj be the maximum processing times. The expected makespan 
to schedule n weighted tasks of total processing time W — ^pj by DLS is bounded by 



E[C„ 



W TO — 

J < — + 

m m 



1 
-Pn 



3.24 • log2 n 



1 



2 In 2 



1 



Proof Let $t be the potential defined by Equation 10 At time t — 0, the potential of the 



system is bounded by W^ — W. Therefore, by Theorem [T] the number of work requests 
before <l>t < 1 is bounded by 



• A • log2 *0 + 1 



1 

11^2 



<m-v\{v) ■ 21og2V7 + l + 



1 



where v\{v) < 3.24 is the same constant as the bound for the unit tasks with the potential 
function ^- w^ of Theorem [3] 



DECENTRALIZED LIST SCHEDULING 



As $t G N, <i>t < 1 implies that $t = 0. Moreover, by definition of $t, this implies 
that for all i: Wi{tY — Wi{t) = 0, which implies that for all i: i«i(i) < 1. Therefore, once 
$t is equal to 0, there is at most one task per processor. This phase can last for at most 
Pmax unit of time, generating at most [m — l)pinax work requests. D D 



Remark. The same analysis applies for the cooperative stealing scheme of Section 5.2 
leading to the same improved bound in 2.88 log2 n instead of 3.24 logj n. 

7. Tasks with precedences 

In this section, we show how the well known non-blocking work stealing of |Arora et al| 
( |2001) l (denoted ABP in the sequel) can be analyzed with our method which provides 
tighter bounds for the makespan. We first recall the WS scheduler of ABP, then we show 
how to define the amount of work on a processor Wi^t), finally we apply the analysis of 
SectionlSlto bound the makespan. 



7.1. ABP work-stealing scheduler. Following Arora et al ( 2001[ i, a multithreaded com 



putation is modeled as a directed acyclic graph G with W unit tasks task and edges define 
precedence constraints. There is a single source task and the out-degree is at most 2. The 
critical path of G is denoted by D. ABP schedules the DAG G as follows. Each processor i 
maintains a double-ended queue (called a deque) Qi of ready tasks. At each slot, an active 
processor i with a non-empty deque executes the task at the bottom of its deque Qi; once 
its execution is completed, this task is popped from the bottom of the deque, enabling - i.e. 
making ready - 0, 1 or 2 child tasks that are pushed at the bottom of Qi. At each top, an 
idle processor j with an empty deque Qj becomes a thief: it performs a steal request on 
another randomly chosen victim deque; if the victim deque contains ready tasks, then its 
top-most task is popped and pushed into the deque of one of its concurrent thieves. If j 
becomes active just after its steal request, the steal request is said successful. Otherwise, 
Qj remains empty and the steal request fails which may occur in the three following sit- 
uations: either the victim deque Qi is empty; or, Qi contains only one task currently in 
execution on i; or, due to contention, another thief performs a successful steal request on i 
simultaneously. 



7.2. Definition of Wi {t). Let us first recall the definition of the enabling tree of Arora et al 



(2001 1. If the execution of task u enables task v, then the edge (u, v) of G is an enabling 
edge. The sub-graph of G consisting of only enabling edges forms a rooted tree called the 
enabling tree. We denote by h{u) the height of a task u in the enabling tree. The root of 
the DAG has height D. Moreover, it has been shown in |Arora et al| ( |200T] l that tasks in the 
deque have strictly decreasing height from top to bottom except for the two bottom most 
tasks which can have equal heights. 

We now define Wi{t), the amount of work on processor i at time t. Let ht be the 
maximum height of all tasks in the deque. If the deque contains at least two tasks including 
the one currently executing we define Wi{t) — (2\/2)''*. If the deque contains only one 
task currently executing we define Wi{t) ~ ^ • [2^/2)'^* . The following lemma states that 
this definition of Wi [t) behaves in a similar way than the one used for the independent unit 
tasks analysis of Section]?] 

Lemma 1. For any active processor i, we have Wi(t + 1) < Wi{t). Moreover, after any 
successful steal request from a processor j on i, Wi(t + 1) < Wi{t)/2 and Wj{t + 1) < 
Wi{t)/2 and if all steal requests are unsuccessful we have Wi{t + 1) < Wi(t)/v2- 
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Proof. We first analyze the execution of one task u at the bottom of the deque. Executing 
task u enables at most two tasks and these tasks are the children of u in the enabling 
tree. If the deque contains more than one task, the top most task has height ht and this 
task is still in the deque at time t + 1. Thus the maximum height does not change and 
Wi{t) — Wi{t + 1). If the deque contains only one task, we have u;i(t) — ^ • (2\/2)'*' and 
Wi{t + 1) < (2^/2)'"-l. Thus w^{t + 1) < Wi{t). 

We now analyze a successful steal from processor j. In this case, the deque of processor 
i contains at least two tasks and Wi{t) — (2\/2)'*'. The stolen task is one with the maxi- 
mum height and is the only task in the deque of processor j thus i(;j(i+l) = |-(2a/2)''* < 
Wi{t)/2. For the processor i, either its deque contains only one task after the steal with 
height at most ht and Wi{t + 1) < | • (2a/2)''' < Wi{t)/2, either there are still more than 
2 tasks and Wi{t + 1) < (2\/2)'''-i < Wi{t)/2. 

Finally, if all steal requests are unsuccessful, the deque of processor i contains at most 
one task. If the deque is empty Wi{t + 1) = Wi{t) = and thus Wi{t + 1) < Wi{t)/\/2. 
If the deque contains exactly one task, Wi{t) = \ ■ (2^2)'*' and Wi{t + 1) < (2\/2)'"^i 
thus Wi{t + 1) < w^{t)/V2. D D 

7.3. Bound on the makespan. To study the number of steals, we follow the analysis 
presented in SectionBJwith the potential function $(t) — ^^ Wi{ty. Using results from 
lemmafl] we compute the decrease of the potential Si{t) due to steal requests on processor 
i by distinguishing two cases. If there is a successful steal from processor j, 

S,it) - w^ity - w,it + If - w,{t + If > w^itf - 2 • {^y > I ■ w,{tf. 
If all steals are unsuccessful, the decrease of the potential is 

5,{t) = w^itf w,(t + If > w.itf (^)' > 1 . w,{tf. 

In all cases, 6i{t) > Wi{tf/2. We obtain the expected potential at time t + 1 by summing 
the expected decrease on each active processor: 

4 = 

E [$,+1 I .^t] < (l - ^) • Ht) 

Finally, we can state the following theorem. 

Theorem 6. On a DAG composed of W unit tasks, with critical path D, one source and 
out-degree at most 2, the makespan ofABP work stealing verifies: 
W i W 

m l-log2(l + ^) m 

(ii) P (Cnax > — + ^ ■ ^.^ , 1, • (d + l0g2 ") + l| < £ 

t m l-log2(l + ^) V eJ J 

Proof. The proof is a direct application of Theorem [T] As in the initial step there is only 
one non empty deque containing the root task with height D, the initial potential is 



$(0) = (i • (2V2 



Dx 2 
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Thus the expected number of steal requests before <i>(t) < 1 is bounded by 

1 / ^\-°\^l 



E[i?]<A-m.log2[(--(2V2) ) 



1 ' 



ln(2) 



<2X-m- D- log2(2\/2) + m ■ (l + —— - 2A 

< 3A • m • Z3 (as 1 + A/ hi(2) - 2A < 0) 

where A = (1 — log2(l + 1/e))^^ is the same constant as the bound for the unit tasks of 
Section|4] 

Moreover, when $(i) < 1, we have Vi, Wi{t) < 1. There is at most one task of height 
in each deque, i.e. a leaf of the enabling tree which cannot enable any other task. This 
last step generates at most tti — 1 additional steal requests. In total, the expected number 
of steal requests is bounded by E [i?] < 3X ■ m ■ D + ni — 1. The bound on the makespan 
is obtained using the relation m ■ Cmax — W + R. 

The proof of (i) applies mutatis mutandis to prove the bound in probability (ii). D D 



Remark. In Arora et al (2001 1, the authors established the upper bounds : 



W ( W 11 

E [C„,ax] < — + 32 • D and P <^ a„ax > — + 64 • D + 16 • logs - ^ < e 
m [_ m e J 

in Section 4.3, proof of Theorem 9. Our bounds greatly improve the constant factors of 
this previous result. 

8. Experimental study 

The theoretical analysis gives an upper bounds on the expected value of the makespan 
and deviation from the mean for the various models we considered. In this section, we 
study experimentally the distribution of the makespan. Statistical tests give evidence that 
the makespan for independent tasks follows a generalized extreme value (gev) distribu- 



tion (Kotz and Nadarajah 2001 1. This was expected since such a distribution arises when 
dealing with maximum of random variables. For tasks with dependencies, it depends on 
the structure of the graph: DAGs with short critical path still follow a gev distribution but 
when the critical path grows, it tends to a gaussian distribution. We also study in more 
details the overhead to W/m and show that it is approximately 2.37 log2 W for unit inde- 
pendent tasks which is close to the theoretical result of 3.24 log2 W [cf. SectionB). 

We developed a simulator that strictly follows our model. At the beginning, all the tasks 
are given to processor in order to be in the worst case, i.e. when the initial potential $o is 
maximum. Each pau" (m,W) is simulated 10000 to get accurate results, with a coefficient 
of variation about 2%. 

8.1. Distribution of the makespan. We consider here a fixed workload W = 2^'' onm ^ 
2^° processors for independent tasks and to = 2^" processors for tasks with dependencies. 
For the weighted model, processing times were generated randomly and uniformly between 
1 and 10. For the DAG model, graphs have been generated using a layer by layer method. 
We generated two types of DAGs, one with a short critical path (close to the minimum 
possible log2 W) and the other one with a long critical path (around W/Am in order to 
keep enough tasks per processor per layer). Fig. |4]presents histograms for Cmax — \W/ni\ . 
The distributions of the first three models (a,b,c in Fig.|4| are clearly not gaussian: they 
are asymmetrical with an heavier right tail. To fit these three models, we use the general- 
ized extreme value (gev) distribution (Kotz and Nadarajah 2001|l. In the same way as the 
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(a) Unit Tasks 



(b) Weighted Taslcs 



(c) DAG (short D) 



(d) DAG (long D) 



Figure 4 . Distribution of the makespan for unit independent tasks 4(a) 



weighted independent tasks 4(b) and tasks with dependencies 4(c) 



and 4(d) The first three models follow a gev distribution (blue curves), 



the last one is gaussian (red curve). 

normal distribution arises when studying the sum of independent and identically distributed 
(iid) random variables, the gev distribution arises when studying the maximum of iid ran- 
dom variables. The extreme value theorem, an equivalent of the central limit theorem for 
maxima, states that the maximum of iid random variables converges in distribution to a 
gev distribution. In our setting, the random variables measuring the load of each processor 
are not independent, thus the extreme value theorem cannot apply directly. However, it is 
possible to fit the distribution of the makespan to a gev distribution. In Fig.|4] the fitted dis- 
tributions (blue curve) closely follow the histograms. To confirm this graphical approach, 
we performed a goodness of fit test. The x^ test is well-suited to our data because the 
distribution of the makespan is discrete. We compared the results of the best fitted gev to 
the best fitted gaussian. The x^ test strongly rejects the gaussian hypothesis but does not 
reject the gev hypothesis with a p-value of more than 0.5. This confirms that the makespan 
follows a gev distribution. We fitted the last model, DAG with long critical path, with a 



gaussian (red curve in Fig. 4(d) i. In this last case, the completion time of each layer of the 



DAG should correspond to a gev distribution but the total makespan, the sums of all layers, 
should tend to a gaussian by the central limit theorem. Indeed the x^ test does not reject 
the gaussian hypothesis with a p-value around 0.3. 

8.2. Study of the log2 W term. We focus now on unit independent tasks as the other 
models rely on too many parameters (the choice of the processing times for weighted tasks 
and the structure of the DAG for tasks with dependencies). We want to show that the 
number of work requests is proportional to log2 W and study the proportionality constant. 
We first launch simulations with a fixed number of processors m and a wide range of work 
in successive powers of 10. A linear regression confirms the linear dependency in logj W 
with a coefficient of determination ("r squared") greater than 0.9999j 

Then, we obtain the slope of the regression for various number of processors. The value 
of the slope tends to a limit around 2.37 {cf. Fig. Blleft)). This shows that the theoretical 
analysis of Theorem |2] is almost accurate with a constant of approximately 3.24. We also 
study the constant factor of logj W for the cooperative steal of SectionIS] The theoretical 
value of 2.88 is again close to the value obtained by simulation 2.08 (cf. Figure IStleft)). 
The difference between the theoretical and the practical values can be explained by the 
worst case analysis on the number of steal requests per time step in Theoremfl] 
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Figure 5 . (Left) Constant factor of log2 W against the number of pro- 
cessors for the standard steal and the cooperative steal. (Right) Ratio of 
steal requests (standard/cooperative). 

Moreover, simulations in Fig. [Sjright) show that the ratio of steal requests between 
standard and cooperative steals goes asymptotically to 14%. The ratio between the two 
corresponding theoretical bounds is about 12%. This indicates that the biais introduced by 
our analysis is systematic and thus, our analysis may be used as a good prediction while 
using cooperation among thieves. 



9. Concluding Remarks 

In this paper, we presented a complete analysis of the cost of distribution in list sched- 
uling. We proposed a new framework, based on potential functions, for analyzing the 
complexity of distributed list scheduling algorithms. In all variants of the problem, we 
succeeded to characterize precisely the overhead due to the decentralization of the list. 
These results are summarized in the following table comparing makespans for standard 
(centralized) and decentralized list scheduling. 
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W m- 1 

— + D 

m m 
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In particular, in the case of independent tasks, the overhead due to the distribution is 
small and only depends on the number of tasks and not on their weights. In addition, this 
analysis improves the bounds for the classical work stealing algorithm of Arora et al (2001|l 
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from 32D to 5.5D. We believe that this work should help to clarify the links between 
classical list scheduling and work steahng. 

Furthermore, the framework to analyze DLS algorithms described in this paper is more 



general than the method of Arora et al (2001 1. Indeed, we do not assume a specific rule 



(e.g. depth first execution of tasks) to manage the local lists. Moreover, we do not refer 
to the structure of the DAG (e.g. the depth of a task in the enabling tree) but on the 
work contained in each list. Thus, we plan to extend this analysis to the case of general 
precedence graphs. 

Acknowledgements. The authors would like to thank Julien Bernard and Jean-Louis 
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